.. _system_preparation:

Preparing systems for constant-pH simulation
********************************************

Several steps are necessary to prepare a simulation system for simulation using ``protons``.
Most of ``protons`` relies on the `OpenMM API for Amber files`_. In order to run constant-pH simulations, the ``protons`` package requires input files generated by several programs
from the Ambertools suite.

These instructions assume that you have a ``.pdb`` file of your system that has all necessary components to start an MD simulation.
If you need help adding missing residues, atoms, or other features to your system, consider reading the OpenMM Modeller_ documentation, or the `Amber manual`_.

Using the examples shown below, we will demonstrate how to generate input for an implicit solvent constant-pH simulation, including

* An amber topology file (``.prmtop``), which contains the parameters and topology of the system.
* A coordinate file (``.inpcrd``), which contains the 3-dimensional coordinates of your system.
* A protonation state file (``.cpin``), which enumerate protonation states of amino acids.

.. _Modeller: http://docs.openmm.org/7.0.0/userguide/application.html#model-building-and-editing
.. _OpenMM API for Amber files: http://docs.openmm.org/7.0.0/userguide/application.html#using-amber-files
.. _Amber manual: http://ambermd.org/doc12/Amber16.pdf


Obtaining and using Ambertools
==============================

Ambertools provides a series of command-line utilities that aid in setting up simulation systems.
We will be using these for the manipulation of ``.pdb`` and ``.mol2`` files, and for generating parameters for small
molecules using the general Amber force field (GAFF).
The instructions below have been tested with Ambertools16.

Ambertools16 can be obtained from the `Amber website`_.

For any instructions relating to Ambertools programs, we refer to the `Amber manual`_.
This includes instructions on how to install Ambertools.

After following the installation procedure, please make sure you can access the following programs on your command-line:

* ``tleap``,
* ``cpinutil.py``,

and if you're planning to use small molecules:

* ``antechamber``,
* ``parmchk``.

We will be using these to prepare systems for constant-pH simulation in OpenMM.

.. _Amber website: http://ambermd.org/AmberTools16-get.html.

Fixing residue names and atoms
==============================

Several modifications will need to be made to your input file, in order to rename residues, and prevent issues with preexisting hydrogen atoms.

Renaming residues
-----------------

One of the first steps of preparing an input file for constant-pH simulation is to rename all residues in the pdb file to their most protonated form.

The shortlist for PDB residues that our code supports is as follows:
 * Histidine (HIP)
 * Aspartic acid (AS4)
 * Glutamic acid (GL4)
 * Cysteine (CYS)
 * Lysine (LYS)
 * Arginine (ARG)

 .. Note::
   You may also need to rename any cysteine residues involved in disulfide bonds from
   ``CYS → CYX``, for further processing steps.


Treating hydrogens
------------------

To properly add all necessary hydrogens to the system, it is easiest to first delete all current hydrogens in the system.
Otherwise, improperly named hydrogens may cause issues when defining the topology of the system based on the ``.pdb`` file.
Missing hydrogens will be added back in the next preparation step, so it is safe to delete them at this stage.


An example bash script for processing ``.pdb`` files
----------------------------------------------------

Calling the following shell script on your pdb file should help you prepare your ``.pdb`` file for simulation.
It replaces the names of residues in-place in the file, and it removes hydrogens from your file.

.. code-block:: bash

    pdbfile="input.pdb"

    # Rename to constph residues
    sed -i 's/HIE/HIP/g' ${pdbfile}
    sed -i 's/HID/HIP/g' ${pdbfile}
    sed -i 's/HIS/HIP/g' ${pdbfile}
    sed -i 's/ASP/AS4/g' ${pdbfile}
    sed -i 's/GLU/GL4/g' ${pdbfile}

    # Remove hydrogens (print lines with atom names not starting with H)
    awk '$3 !~ /^ *H/' ${pdbfile} > tmp && mv tmp ${pdbfile}

Tools like ``sed`` and ``awk`` can be very useful for manipulating text files quickly,
though I recommend that you verify the output manually afterwards.


Writing out coordinate and parameter files using Leap
=====================================================

The next step in system preparation is generating the Amber coordinate (``.inpcrd``), and  the topology and parameters files (``.prmtop``).
This can be done using the command-line utility ``tleap``, which is available as part of ambertools.

These instructions will suffice if your system does not include any small molecules.
If your system contains a ligand, please see the `next section`_.

You can use ``tleap`` interactively on the command line. Just type

.. code-block:: bash

    tleap

And you will see output similar to this

.. code-block:: c

    -bash-4.1$ tleap
    -I: Adding /home/user/bin/../dat/leap/prep to search path.
    -I: Adding /home/user/bin/../dat/leap/lib to search path.
    -I: Adding /home/user/bin/../dat/leap/parm to search path.
    -I: Adding/home/user/bin/../dat/leap/cmd to search path.

    Welcome to LEaP!
    (no leaprc in search path)
    > █

You can start typing your commands line by line.
Alternatively, you can store commands in a text file, and then use

.. code-block:: bash

    tleap -f tleap.txt

and tleap will run the specified commands automatically.
Tleap output can be rather verbose.
It is recommended to write the output to file, so you can verify that all steps were executed correctly.

Here is a bash example:

.. code-block:: bash

    tleap -f tleap.in >> tleap.out 2>&1

You can rename the ``.out`` file to anything of your choosing.

Tleap commands
--------------

The following sequence of commands should do for a simple pdb file containing one protein structure.

.. code-block:: c

    # Load constant ph parameters
    source leaprc.constph

    # Load the PDB file, rename it to your input file
    protein = loadPDB input.pdb

    # Validate the input
    check protein

    # Calculate the total charge, for logging purposes
    charge protein

    # Write parameters.
    saveAmberParm protein complex.prmtop complex.inpcrd

    # Write PDB files, optional
    savepdb protein complex.pdb

    # Exit, make sure not to forget this part
    quit

If you need to perform other steps to prepare your system for simulation, please read the `Amber manual`_.


Validating tleap results
------------------------

If you run interactively, tleap should provide error messages on screen.
The output can be rather verbose, so make sure that your terminal is configured to scroll back far.

Alternatively, if you run using an input file, make sure that ``tleap`` ran successfully.

I often write output to a log file, and check the log file for errors.
Here is a short bash snippet that does the trick.

.. code-block:: bash

    tleap -f tleap.in >> tleap.out 2>&1

    # There might be other error clues. This method isn't fail safe.
    tleap_result=$(grep "usage" tleap.out || grep -i "error" tleap.out)

    # As long as the grep results are empty
     if [ -z "$tleap_result" ]
     then
       echo -e "\e[32mTleap looks successful. Still, act cautious. She's a slippery one.\e[39m"
     else
       echo -e "\e[31mCaught an error in Tleap. Tough luck, buddy.\e[39m"
       echo $tleap_result
     fi


This procedure generates three different files:

* ``complex.prmtop``, an Amber topology file which contains the topology and parameters of the protein system.
* ``complex.inpcrd``, a file containing the coordinates of all atoms in the system
* ``complex.pdb``, this file is optional. You can use a ``pdb`` file in software such as PyMOL_, to verify that the prepared structure doesn't contain mistakes.

You will be needing these to run your OpenMM script.

.. _PyMOL: http://pymol.org/

Including ligands in your system
================================

.. _next section:

.. warning::

    * Ligand support is a work in progress. We've experienced system instability with small molecules in implicit solvent simulations.


If you have a ligand, you will have to prepare your ligand using ``antechamber``, and ``parmchk``.
This is used to generate two files

* ``ligand.gaff.mol2``, a mol2 file with GAFF atom types.
* ``ligand.gaff.frcmod``, an frcmod file with GAFF parameters for the ligand.

Here is an example of how to run ``antechamber`` and ``parmchk``.

.. code-block:: bash

    antechamber -i ligand.mol2 -fi mol2 -o ligand.gaff.mol2 -fo mol2
    parmchk -i ligand.gaff.mol2 -o ligand.gaff.frcmod  -f mol2

You may wish to explore the advanced options of ``antechamber`` if you need to generate charges for your ligands.
If you want to generate charges in another program, using a ``.mol2`` file should allow you to maintain those charges.
Now that you've generated parameters for your ligand, these files then need to be added to your leap setup.

Here is an example leap script.


.. code-block:: c

    # Load constant ph parameters
    source leaprc.constph

    # Gaff params
    source leaprc.gaff

    # Load ligand parameters
    ligand = loadMol2 ligand.gaff.mol2
    loadAmberParams ligand.gaff.frcmod

    # Load the PDBs
    protein = loadPDB protein.pdb

    # Combine into one complex
    complex = combine { protein ligand }

    # Validate the input
    check complex

    # Calculate the total charge, for logging purposes
    charge complex

    # Write parameters.
    saveAmberParm  complex  complex.prmtop complex.inpcrd

    # Write PDB files
    savepdb protein complex.pdb

    # Exit, make sure not to forget this part
    quit


.. todo::
    * In the current version of the code, ligands can not be treated using constant-pH methodologies.


Generating parameters for amino acid protonation states
=======================================================

The last step to generate input for the constant-pH simulation is to generate a ``.cpin`` file for your protein.
This file contains the parameters for the different protonation states of the amino acids in the system.

A ``.cpin`` file can be generated by ``cpinutil.py``, which is also distributed as part of Ambertools.

.. code-block:: bash

    cpinutil.py -resnames HIP GL4 AS4 TYR LYS CYS -p complex.prmtop -o complex.cpin